Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Jan 13, 2026

To try it:

pip install requirements.txt
--fast dynamic_vram

NOTE: This work does not have any GGUF integration and GGUF will not see any benefits yet.

NOTE: I am aware of increase Windows RAM usage when not configuring a pagefile due to commit quota exhaustion. If anyone is testing please stay tuned for a major fix to windows RAM usage incoming. The VRAM stuff on windows is still testable. Linux is unaffected. (FIXED)

If you try it, please reply to the PR (if is hasn't been merged) with any issues or feel free to make an issue ticket for bigger test cases with logs and numbers.

Features

  • A new ModelPatcher implementation which backs onto comfy-aimdo to implement varying model load levels that can be adjusted during model use. The patcher defers all load processes to lazily load the model during use (e.g. the first step of a ksampler) and automatically negotiates a load level during the inference to maximize VRAM usage without OOMing. If inference requires more VRAM than is available weights are offloaded to make space before the OOM happens.

  • This will eventually allow for development of ComfyUI without needing to estimate model VRAM usage at all.

  • Large RAM and Windows commit-charge savings. No need to load models fully to RAM. This also gives a much higher chance of having a model in disk cache and saving the user from a disk load delay on first run as there is no primary load to process memory displacing the disk-cache any more.

  • Windows GPU shared memory usage avoidance

  • A deep copy of the model is cut in the safetensors save process (incidental improvement)

  • Reduced VRAM usage in async offload stream which cuda malloc disabled (pre-requisite improvement)

Implementation Details

Aimdo readme here: https://pypi.org/project/comfy-aimdo/

The long story on RAM: Aimdos ability to just evict weights means its no longer possible to .to() a weight back and forth from the GPU. VRAM pressure can occur at any time during inference and there is no clean way to .to() weights or modules back to the CPU while pytorch is stacked in the middle on a pending VRAM allocation. So as we can never .to() a weight we instead take the opportunity to leave the model parameter as known to pytorch on the CPU permanently with assign=True state dict loading. Since its never write touched it lives in mmap permanently and never consumes any process allocated RAM. Several community developers have flagged this as a possible major enhancement to comfy already and the needed changes to model load and unload align with the VRAM problems.

(NEW) Windows has extra RAM complications with its pessimistic allocation and how it forbids overcommit other than with the pagefile. Two changes are made to drastically reduce commit charge. Linear nn.Modules are now constructed without the placeholder weight as this consumes commit charge. The other change is a lightweight safetensors load that loads files in READ mode (safetensors package uses CoW) which avoids getting commit-charged for the whole model on file load.

As for loading the weight onto the GPU, that happens via comfy_cast_weights which is now used in all cases. cast_bias_weight checks whether the VBAR assigned to the model has space for the weight (based on the same load priority semantics as the original ModelPatcher). If it does, the VRAM as returned by the Aimdo allocator is used as the parameter GPU side. The caster is responsible for populating the weight data. This is done using the usual offload_stream (which mean we now have asynchronous load overlapping first use compute).

Pinning works a little differently. When a weight is detected during load as unable to fit, a pin is allocated at the time of casting and the weight as used by the layer is DMAd back to the the pin using the GPU DMA TX engine, also using the asynchronous offload streams. This means you get to pin the Lora modified and requantized weights which can be a major speedup for offload+quantize+lora use cases, This works around the JIT Lora + FP8 exclusion and brings FP8MM to heavy offloading users (who probably really need it with more modest GPUs). There is a performance risk in that a CPU+RAM patch has been replace with a GPU+RAM patch but my initial performance results look good. Most users as likely to have a GPU that outruns their CPU in these woods.

Some common code is written to consolidate a layers tensors for aimdo mapping, pinning, and DMA transfers. interpret_gathered_like() allows unpacking a raw buffer as a set of tensors. This is used consistently to bundle and pack weights, quantization metadata (QuantizedTensor bits) and biases into one payload for DMA in the load process reducing Cuda overhead a little. Some Quantization metadata was missing async offload is some cases which is now added. This also pins quantization metadata and consolidates the number of cuda_host_register calls (which can be expensive).

Model saving is reworked to avoid the force_cast_weights flag which doesnt make sense in ModelPatcherDynamic. This rework was able to cut a RAM copy of the model by doing on-the-fly model patching during the save process which worked out to be a nice RAM saving while fixing my API problem.

Aimdo (under the hood) links with Windows APIs to adjust load levels based on the WDDM target VRAM usage rather than using the pytorch/Cuda stack reported numbers (which are WDDMs lies). This means as soon as shared memory spilling occurs on Windows, weights will be unloaded until you get out of the spill state and inference state will move back to VRAM.

Offload streams now have an accompanying single shared cast buffer that grows as needed. This is to avoid significant waste and fragmentation in the cast buffer streams when offloading multiple weight sizes as we don't have cuda_malloc and the pytorch allocator completely isolates memory by stream. So go a little hands on the low level to keep those allocation pools minimized. This is applied to non --dynamic_vram case when using non cuda_malloc as it doesn reduce VRAM esp on flux2 with those huge and varying weights.

Future Work

  • Pin RAM management could use more optimization and try and get pins behind the mmap of active models in priority under RAM pressure to allow much more aggressive pin retention. The heuristic (currently free model x2) can also just be straight improved.
  • First iterations are slower than id hoped. Some multi-threading of CPU/RAM bottlenecking might allow for further run ahead and bottleneck saturation.
    - The progress meter needs some work. Its jarring to have it stall of the first iteration when its doing a slow model load. (DONE)

Example Test case:

Flux2 + Lora text to image.
RTX5090 with 8GB of VRAM consumed by non comfy application (24GB effective)
PCIE5 NVME, 96GB RAM.
Disk caches warm with model

image

Before:

________________________________________________________________________
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 21458.36 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
loaded partially; 19731.54 MB usable, 18720.00 MB loaded, 15093.00 MB offloaded, 1152.00 MB buffer reserved, lowvram patches: 72
Initializing ControlAltAI Nodes
100%|██████████| 20/20 [00:25<00:00,  1.27s/it]
Requested to load AutoencoderKL
Unloaded partially: 1152.00 MB freed, 17568.00 MB remains loaded, 1152.00 MB buffer reserved, lowvram patches: 80
loaded completely; 190.98 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 39.58 seconds

General Memory Usage

image

Peak VRAM:

image

After (--fast dynamic_vram)

________________________________________________________________________
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 17180MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 138 patches attached.
Initializing ControlAltAI Nodes
100%|██████████| 20/20 [00:28<00:00,  1.41s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 34.79 seconds

General Memory usage:

image

Peak VRAM:

image

More test data to come. Most workflows I have run are faster with this.

Im testing various things and updates bugfixes etc but enough works for a PR.

@socket-security
Copy link

socket-security bot commented Jan 13, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcomfy-aimdo@​0.1.210010010010070

View full report

@rattus128 rattus128 marked this pull request as draft January 13, 2026 10:29
@MeiYi-dev
Copy link

MeiYi-dev commented Jan 13, 2026

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

@rattus128
Copy link
Contributor Author

rattus128 commented Jan 13, 2026

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

This loader doesn't unload back to RAM at all so it wont spill to pagefile. The idea is, if you dont have enough RAM, just dump it, because its faster to just read it from file on disk again than to write and read to pagefile. If you do have enough RAM, the OS will just leave the model in disk cache from the first load. So this should be faster for you.

You margins are very low, you might do well --disable-pinned-memory but if you try it, try it both ways.

kudos for LTX2 on 16 and these performance points is what im trying to really make work here.

@MeiYi-dev
Copy link

MeiYi-dev commented Jan 13, 2026

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

This loader doesn't unload back to RAM at all so it wont spill to pagefile. The idea is, if you dont have enough RAM, just dump it, because its faster to just read it from file on disk again than to write and read to pagefile. If you do have enough RAM, the OS will just leave the model in disk cache from the first load. So this should be faster for you.

You margins are very low, you might do well --disable-pinned-memory but if you try it, try it both ways.

kudos for LTX2 on 16 and these performance points is what im trying to really make work here.

Loading the model file from the disk again does seem to be a nice way to prevent useless writing to pagefile atleast. It will be very useful for 16GB RAM users. I use the model without any changes to startup args with GGUFs. It works perfectly, though the only issue currently is the offloading the whole model to RAM and pagefile to make space for 4GB haha

@zwukong
Copy link

zwukong commented Jan 13, 2026

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

@FurkanGozukara
Copy link

awesome i hope this gets implemented

@MeiYi-dev
Copy link

MeiYi-dev commented Jan 13, 2026

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

I am looking at the screenshots of this node, where does the model offload to? The ram use doesn't increase even with model only using 6GB VRAM, what's the caveat?

Edit: NVM it offloads to RAM

@MeiYi-dev
Copy link

MeiYi-dev commented Jan 13, 2026

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.

TLDR, it prevents useless writing to pagefile

@zwukong
Copy link

zwukong commented Jan 13, 2026

1、Large RAM savings. in your pic , saving about 1/2😲
2、Windows GPU shared memory usage avoidance. This is the main pain point ,which causes extremely slow speed
If it is true, comfyui will be the god 😄

@zwukong
Copy link

zwukong commented Jan 13, 2026

i tested some results in qwen edit using gguf. Speed the same, Ram almost the same,Vram seems more stable than before。

LTX2 gguf is very bad. Ram is more, speed 1/4, Vram full

reserve-vram is useless now, not good

@Kosinkadink
Copy link
Member

@zwukong note, gguf is not officially supported in ComfyUI and requires the use of a custom node pack, which at this moment does not account for anything changed in this PR since it's so new. For testing, please don't use gguf models at this time!

@anr2me
Copy link

anr2me commented Jan 13, 2026

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.

TLDR, it prevents useless writing to pagefile

That is practically similar to what --normalvram do 😅

I usually use normalvram to minimize both RAM & VRAM usage. Unlike lowvram that forcefully store model in RAM after use, or highvram that forcefully keep the model in VRAM after use, normalvram will free the model after use when there is not enough free memory, thus avoiding swap file usage.

@zwukong
Copy link

zwukong commented Jan 14, 2026

GGUF should be your priority i think, 4090/5090 only 10%, 90% use gguf under 20G vram. Even kj uses gguf too,he has a 5090

@rattus128
Copy link
Contributor Author

rattus128 commented Jan 14, 2026

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.
TLDR, it prevents useless writing to pagefile

That is practically similar to what --normalvram do 😅

I usually use normalvram to minimize both RAM & VRAM usage. Unlike lowvram that forcefully store model in RAM after use, or highvram that forcefully keep the model in VRAM after use, normalvram will free the model after use when there is not enough free memory, thus avoiding swap file usage.

Don't think this flag does anything these days:

rattus@rattus-box2:~/ComfyUI$ git grep normalvram
comfy/cli_args.py:vram_group.add_argument("--normalvram", action="store_true", help="Used to force normal vram use if lowvram gets automatically enabled.")

^^ No users in code search.

lowvram semantics has been the default for a while, as models got too big to assume that users could fit them in VRAM by default.

@kijai
Copy link
Contributor

kijai commented Jan 14, 2026

GGUF should be your priority i think, 4090/5090 only 10%, 90% use gguf under 20G vram. Even kj uses gguf too,he has a 5090

I certainly do not use GGUF on 5090 or even 4090, that would be missing out on fp8 matmuls (the fast mode).

The common misconception seems to be that you have to use GGUF for low VRAM systems, which isn't true when offloading exists and is rather effective in ComfyUI currently. This PR would make it work even better.

Of course on low RAM systems there's less options for offloading and GGUF becomes useful.

@zwukong
Copy link

zwukong commented Jan 14, 2026

Yes, mainly for low RAM and less than 1/4 smaller size. FP8 or FP16 eats too much RAM ,at least twice. So if you use text encoder and unet both are fp8 , then fourth the size . And now RAM and SSD are more expensize than video card😄

@asagi4
Copy link
Contributor

asagi4 commented Jan 14, 2026

comfy_aimdo seems to break ROCM even when this is not in use. It tries to dynamically load libcuda.so.1 which does not exist.

@isaac-mcfadyen
Copy link

Are there plans to open-source comfy-aimdo (similar to what was done with comfy-kitchen)?

@comfyanonymous
Copy link
Member

comfy-aimdo will be open sourced before this pull request is merged.

For gguf we will try not to break it but we are focusing on improving our own native quant system to make it better/faster than gguf.

@FurkanGozukara
Copy link

comfy-aimdo will be open sourced before this pull request is merged.

For gguf we will try not to break it but we are focusing on improving our own native quant system to make it better/faster than gguf.

Any plans of converting existing models into NVFP4? I am trying to make FLUX SRPO managed it but quality dropped sharply

@zwukong
Copy link

zwukong commented Jan 15, 2026

@comfyanonymous I know it is fp4, but 40card can not run. We need Int4 as well, like nunchaku does. Why GGUF is the best for now is quality and size, even Q2 can get pretty good result, Q3 and above almost the same as fp16

@Kosinkadink
Copy link
Member

There is a PR for int4. This PR is for memory management improvements with comfy-aimdo, would be appreciated if this thread was for testing this PR and not feature requests

@zwukong
Copy link

zwukong commented Jan 15, 2026

Not just feature requests, when i tested this PR, GGUF can not benefit. So I want GGUF to be supported too. Most of us uses the Great GGUFs. Almost all my models (about 99%) are GGUF

@thrnz
Copy link

thrnz commented Jan 15, 2026

Would the Windows shared memory avoidance stuff have any effect when using WSL?

If not, and with the changes now maximising VRAM usage (--reserve-vram seemingly no longer functions with dynamic_vram enabled), would that make spilling over into shared memory more likely with WSL?

I've noticed some slowdowns/stalls on subsequent runs along with the normal signs of shared memory sluggishness (low temperatures, low power draw, 100% GPU usage, along with reported shared memory usage in task manager) when testing out the PR, and wonder if the changes might not be suited for WSL as-is.

@RandomGitUser321
Copy link
Contributor

Would the Windows shared memory avoidance stuff have any effect when using WSL?

By default, Windows will only allow 1/2 of your system memory to be used by WSL(without modifying the .wslconfig with memory=24GB or whatever you want to set it to), so you're already going to run into issues quickly. But as far as I know, ComfyUI should pick up on that value and if it does, then it means any other memory management math should also pick up on it as well. Though at the GPU driver level, I'm not sure how they handle shared memory, when used with WSL.

@rattus128
Copy link
Contributor Author

Would the Windows shared memory avoidance stuff have any effect when using WSL?

If not, and with the changes now maximising VRAM usage (--reserve-vram seemingly no longer functions with dynamic_vram enabled), would that make spilling over into shared memory more likely with WSL?

I've noticed some slowdowns/stalls on subsequent runs along with the normal signs of shared memory sluggishness (low temperatures, low power draw, 100% GPU usage, along with reported shared memory usage in task manager) when testing out the PR, and wonder if the changes might not be suited for WSL as-is.

You are right that I ignore --reserve-vram for the moment. It can be implement with a bit of plumbing and ill take it as a feature request (along with --novram), but we might not do that one in V1 as you can just opt out for the interim.

Yeah so WSL is actually a big problem and very difficult (maybe impossible) to fix with regards to shared memory spilling. When you are under WSL you will present as linux to aimdo which wont have its anti-spill in play which is windows specific. Even if we could detect WSL we would not have access to the APIs needed to detect the spill as they are only visible on the host windows.

WSL has value from a linux familiarity point of view and solves some software packaging problems, but unfortunately the extra layer of indirection between comfy and the gpu creates multiple performance problems. If you optimizing comfy performance and like the linux env I VERY strongly recommend a dual boot setup as I have observed major performance differences in offloading setups where linux just beats windows with all over variables held the same (I dual boot my day-to-day test machine between Ubuntu and Win11).

Sync before deleting anything.
This is needed for aimdo where the cache cant self recover from
fragmentation. It is however a good thing to do anyway after an OOM
so make it unconditional.
Be more tolerant of unsupported platforms and fallback properly.
Fixes crash when cuda is not installed at all.
If running on Windows, defer creation of the layer parameters until the state
dict is loaded. This avoids a massive charge in windows commit charge spike
when a model is created and not loaded.

This problem doesnt exist on Linux as linux allows RAM overcommit,
however windows does not. Before dynamic memory work this was also a non issue
as every non-quant model would just immediate RAM load and need the memory
anyway.

Make the workaround windows specific, as there may be someone out there with
some training from scratch workflow (which this might break), and assume said
someone is on Linux.
The CoW MMAP as used by safetensors is hardcoded to CoW which forcibly
consumes windows commit charge on a zero copy. RIP. Implement safetensors
in pytorch itself with a READ mmap to not get commit charged for all our
open models.
This isn't worth it and the likelyhood of inference leaving a complex
data-structure with cyclic reference behind is now. Remove it.

We would replace it with a condition on nodes that actually touch the
GPU which might be win.
This is needed for deepcopy construction. We shouldnt really have deep
copies of MP or MODynamic however this is a stay one in some controlnet
flows.
@rattus128
Copy link
Contributor Author

rebased to 0fc1570 (v0.10.0 +9)

@mohtaufiq175
Copy link

@rattus128

I tried it with Flux 2 Klein 9B, using FP16 & qwen_3_8b_fp8mixed, running it on Windows 11, it seems the ram/memory usage still high?

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11

launch command

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Result with dynamic vram

image image

Without dynamic vram

image image

Also i have this logs, these are from a different run than the screenshot above (with nvitop not running in the background), as I also wanted to compare the speed difference.

Dynamic Vram Logs
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
got prompt
got prompt
got prompt
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.18s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 54.60 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.92s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.38 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.25s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.01 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.06s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.40 seconds
Without Dynamic Vram Logs
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
got prompt
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 5639.68 MB usable, 160.31 MB loaded, full load: True
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5615.68 MB usable, 5278.84 MB loaded, 2984.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5428.44 MB usable, 5091.59 MB loaded, 3171.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1485.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.34s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 61.67 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.57 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.62 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.12s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.42 seconds

Now that the model defined dtype is decoupled from the state_dict
dtypes we need to be able to handle worst case scenario casts between
the SD and VBAR.
Scan created models and save off the dtypes as defined by the model
creation process. This is needed for assign=True, which will override
the dtypes.
If the model defines a dtype that is different to what is in the state
dict, respect that at load time. This is done as part of the casting
process.
@rattus128
Copy link
Contributor Author

rattus128 commented Jan 21, 2026

@rattus128

I tried it with Flux 2 Klein 9B, using FP16 & qwen_3_8b_fp8mixed, running it on Windows 11, it seems the ram/memory usage still high?

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11

launch command

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Result with dynamic vram
image image

Without dynamic vram
image image

Also i have this logs, these are from a different run than the screenshot above (with nvitop not running in the background), as I also wanted to compare the speed difference.
Dynamic Vram Logs

got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
got prompt
got prompt
got prompt
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.18s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 54.60 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.92s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.38 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.25s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.01 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.06s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.40 seconds

Without Dynamic Vram Logs

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
got prompt
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 5639.68 MB usable, 160.31 MB loaded, full load: True
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5615.68 MB usable, 5278.84 MB loaded, 2984.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5428.44 MB usable, 5091.59 MB loaded, 3171.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1485.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.34s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 61.67 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.57 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.62 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.12s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.42 seconds

Thanks for the test. Can you confirm the version of the PR you tried (just type "git show")? Im making changes every day as things come in and if I can associate this data with specific revision that helps. Can I get your PCIE bus width and generation?

I am very very interested in your data if you do exactly the same setup with --disable-pinned-memory, both for your memory numbers and execution times.

The longer story: You RAM consumption as reported by nvtop is usually nothing to worry about as its measuring utilization as opposed to committed memory. Committed memory exhaustion is the one that OOMs and crashes systems. Open task manager and have a look at the memory page and you will see the "Committed" number. This should be lower with the PR. In this PR the model remains in RAM but as a soft uncommitted allocation which windows will automatically free if the system comes under RAM pressure (i.e. its not committed). Because you just load and use the same big model 4 times, this just flatlines on the peak which is fine. The pinned memory is however fully committed and a separate allocation. So if you have the RAM space it will keep around both the pinned copy and original copy of the model and nvitop will count both.

@mohtaufiq175
Copy link

@rattus128 On the previous test, it was on commit 96e5d45

Anyway, Here is a new test run with the latest changes 2d96b2f. Sorry if it’s messy lol.
image
GPU-Z 2 68 0_gLI5ps97V2

1. Without dynamic_vram

python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes

Logs
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.459s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 2963.00 MB usable, 160.31 MB loaded, full load: True
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5631.80 MB usable, 5294.59 MB loaded, 2968.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5465.60 MB usable, 5129.60 MB loaded, 3133.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1090.13 MB usable, 226.02 MB loaded, 17090.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.15s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 50.00 seconds
got prompt
loaded partially; 1088.13 MB usable, 224.02 MB loaded, 17092.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.26s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 23.51 seconds

Peak Commited Memory

explorer_bC6l6EosCY

Second run

explorer_tfJcDZxxLS

2. With dynamic_vram & pinned memory enable.

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Logs
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.433s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.27s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 60.80 seconds
got prompt
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.82s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.64 seconds

The peak committed memory and the second run are similar, both stayed at around 47 GB.

Taskmgr_xGZ0rXAFwU

3. With dynamic_vram & --disable-pinned-memory

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes --disable-pinned-memory

Logs
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.458s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.78s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 43.50 seconds
got prompt
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.47s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.09 seconds

The peak committed memory and the second run are similar; both stayed at around 26 GB.

explorer_5Ehj3UB39T

==============================================================================

Anyway, just in case, this is a separate run of the same workflow with the batch count set to 4 to check the overall execution time. From a casual Comfy user’s perspective, the current Comfy implementation seems still faster, even though it uses more committed memory?

  1. python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes
Logs
python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.433s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
loaded completely; 2963.00 MB usable, 160.31 MB loaded, full load: True
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5631.80 MB usable, 5294.59 MB loaded, 2968.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5425.45 MB usable, 5088.59 MB loaded, 3174.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1090.13 MB usable, 226.02 MB loaded, 17090.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.24s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 47.55 seconds
loaded partially; 1088.13 MB usable, 224.02 MB loaded, 17092.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.99s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 21.86 seconds
loaded partially; 1086.13 MB usable, 196.02 MB loaded, 17120.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.04s/it]
Requested to load AutoencoderKL
loaded completely; 1098.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 22.09 seconds
loaded partially; 1056.13 MB usable, 192.02 MB loaded, 17124.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.99s/it]
Requested to load AutoencoderKL
loaded completely; 1098.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 21.85 seconds
  1. python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes
Logs
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.442s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00,  5.70s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 47.82 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.42s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.11 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00,  5.73s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 23.81 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.87s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 24.12 seconds
  1. python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes --disable-pinned-memory
Logs
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.520s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:26<00:00,  6.55s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 42.52 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.33s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.37 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.24s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.09 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.23s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.08 seconds

@comfy-pr-bot
Copy link
Member

Test Evidence Check

@isaac-mcfadyen
Copy link

Test Evidence Check

Not sure if there's a better place to raise this but this bot has commented on most (every?) PR with this message even if there are no issues in the description, meaning it's quite a lot of noise (an email from GitHub per subscribed PR).

@rattus128
Copy link
Contributor Author

@rattus128 On the previous test, it was on commit 96e5d45

Anyway, Here is a new test run with the latest changes [2d96b2f](https://github.com/Comfy-Org/ComfyUI/

Thanks for this data. It's definitely worth looking into and I am tracking it here: rattus128#2

I'll look into it when I get a chance, I have a system pretty similar to yours (3060+64GB) so hopefully I clean reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview-gpu Creates a preview ephemeral environment for this PR (GPU available)

Projects

None yet

Development

Successfully merging this pull request may close these issues.